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ABSTRACT 

A consistency/latency tradeoff arises as soon as a distributed 
storage system replicates data. For low latency, modern 
storage systems often settle for weak consistency conditions, 
which provide little, or even worse, no guarantee for data 
consistency. In this paper we propose the notion of almost 
strong consistency as a better balance option for the consi¬ 
stency/latency tradeoff. It provides both deterministically 
bounded staleness of data versions for each read and prob¬ 
abilistic quantification on the rate of “reading stale values”, 
while achieving low latency. In the context of distributed 
storage systems, we investigate almost strong consistency 
in terms of 2-atomicity. Our 2AM (2-Atomicity Mainte¬ 
nance) algorithm completes both reads and writes in one 
communication round-trip, and guarantees that each read 
obtains the value of within the latest 2 versions. To quan¬ 
tify the rate of “reading stale values”, we decompose the 
so-called “old-new inversion” phenomenon into concurrency 
patterns and read-write patterns, and propose a stochastic 
queueing model and a timed balls-into-bins model to analyze 
them, respectively. The theoretical analysis not only demon¬ 
strates that “old-new inversions” rarely occur as expected, 
but also reveals that the read-write pattern dominates in 
guaranteeing such rare data inconsistencies. These are fur¬ 
ther confirmed by the experimental results, showing that 
2-atomicity is “good enough” in distributed storage systems 
by achieving low latency, bounded staleness, and rare data 
inconsistencies. 

1. INTRODUCTION 

Distributed storage systems fl3] [l5] [12] 10] are consid¬ 
ered as integral and fundamental components of modern 


Internet services such as e-commerce and social networks. 
They are expected to be fast, always available, highly scal¬ 
able, and network-partition tolerant. To this end, mod¬ 
ern distributed storage systems typically replicate their data 
across different machines and even across multiple data cen¬ 
ters, at the expense of introducing data inconsistency. 

More importantly, as soon as a storage system replicates 
data, a tradeoff between consistency and latency arises [2]. 
This consistency/latency tradeoff arguably has been highly 
influential in system design because it exists even when there 
are no network partitions [2]. In distributed storage systems, 
latency is widely regarded as a critical factor for a large 
class of applications. For example, the experiments from 
Google JT2] demonstrate that increasing web search latency 
100 to 400 ms reduces the daily number of searches per user 
by 0.2% to 0.6%. Therefore, most storage systems (and 
applications built on them) are designed for low latency in 
the first place. They often sacrifice strong consistency and 
settle for weaker ones, such as eventual consistency [15] , 
per-record timeline consistency [12], and causal consistency 
[23] . However, such weak consistency models usually provide 
little, or even worse, no guarantee for data consistency. More 
specifically, they neither make any deterministic guarantee 
on the staleness of the data returned by reads nor provide 
probabilistic hints on the rate of violations with respect to 
the desired strong consistency. 

In this paper we propose the notion of almost strong consi¬ 
stency as a better balance option for the consistency/latency 
tradeoff. The implication of the term “almost” is twofold. 
On one hand, it provides deterministically bounded staleness 
of data versions for each read. Thus, the users are confident 
that out-of-date data is still useful as long as they can toler¬ 
ate certain staleness. On the other hand, it further provides 
probabilistic quantification on the low rate of “reading stale 
values”. This ensures that the users are actually accessing 
up-to-date data most of the time. 

We illustrate the idea of almost strong consistency by an 
exemplar mobile-app-based taxi transportation system. In 
this system, each taxi periodically reports its location data 
to the data server. Due to the natural locality of the update 
and request of location data, the city is partitioned into 
multiple areas and a data server is deployed in each area. 
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The location data is replicated among all the data servers. 
Thus the users all over the city can request the location data 
via a mobile application like Uber [3]. Though consistency 
is a desirable property, in this application the user may be 
more concerned of how long he has to wait before his query 
can be served. We argue that the application may be willing 
to trade certain consistency for low latency, as long as the 
inconsistency is bounded and the application can still access 
up-to-date data most of the time 
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In the context of distributed storage systems, we inves¬ 
tigate almost strong consistency in terms of 2-atomicity. 
By instantiating the abstract notion of almost strong consi¬ 
stency, the 2-atomicity semantics also includes two essential 
parts as elaborated below. 

First, the 2-atomicity semantics guarantees that the value 
returned by each read is one of the latest 2 versions, besides 
admitting an implementation with low latency. By “low 
latency” we mean that both reads and writes complete in 
one communication round-trip. Theoretically, it has been 
proved impossible to achieve low latency while enforcing 
each read to return the latest data version, as required in 
given that a minority of replicas may 
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atomicity 

fail [16]. For example, the ABD algorithm [7] for emulat¬ 
ing atomic registers requires each read to complete in two 
round-trips. This impossibility result justifies the relaxed 
consistency semantics of 2-atomicity. In the transportation 
system example above, the taxi location data can still be 
useful if the data returned is no more stale than the previ¬ 
ous version to the latest one. This is mainly because the lo¬ 
cation data cannot change abruptly and the taxi frequently 
updates its location in this scenario. 

Second, the 2-atomicity semantics provides probabilistic 
quantification on the rate of violations of atomicity. In 
data storage systems, atomicity is widely used as the for¬ 
mal definition of strongly consistent or up-to-date data ac¬ 
cess. By bounding the probability of violating atomicity, the 
2-atomicity semantics provides another orthogonal perspec¬ 
tive for expressing how strong consistency is “almost” guar¬ 
anteed. In our example above, since the user may request 
the location data of a number of taxies, the inconsistency 
data may not affect the quality of service experienced by 
the user, as long as only a small portion of the query return 
slightly stale data. 

Our 2AM (2-Atomicity Maintenance) algorithm for main¬ 
taining 2-atomicity in distributed storage systems completes 
both reads and writes in one communication round-trip, and 
guarantees that each read obtains the value of within the 
latest 2 versions. To quantify the rate of “reading stale 
values”, we decompose the so-called “old-new inversion” 
phenomenon into two patterns: concurrency pattern and 
read-write pattern. We then propose a stochastic queue¬ 
ing model and a timed balls-into-bins model to analyze the 
two patterns, respectively. The theoretical analysis not only 
demonstrates that “old-new inversions” rarely occur as ex¬ 
pected, but also reveals that the read-write pattern domi¬ 
nates in guaranteeing such rare violations. 

We have also implemented a prototype data storage sys¬ 
tem among mobile phones, which provides 2-atomic data 
access based on the 2AM algorithm and atomic data ac¬ 
cess based on the ABD algorithm. The read latency in our 
2AM algorithm has been significantly reduced, compared to 
that in the ABD algorithm. More importantly, the experi¬ 
mental results have confirmed our theoretical analysis above. 
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Figure 1: Distributed storage system model. 


Specifically, the proportion of old-new inversions incurred in 
the 2AM algorithm is typically less than 0.1%o, and the pro¬ 
portion of read-write patterns among concurrency patterns 
(e.g., about 0.1%o in some setting) is much less than that of 
concurrency patterns themselves (e.g., more than 50% in the 
same setting). Thus 2-atomicity is “good enough” in distri¬ 
buted storage systems by achieving low latency, bounded 
staleness, and rare atomicity violations. 

The remainder of the paper is organized as follows. Sec¬ 
tion [2] proposes the notion of almost strong consistency and 
discusses how to define it in terms of 2-atomicity in the con¬ 
text of distributed storage systems. Section [3] presents the 
2AM (2-Atomicity Maintenance) algorithm which achieves 
deterministically bounded staleness. Section [4] is concerned 
with the theoretical analysis of the atomicity violations in¬ 
curred in the 2AM algorithm. Section [5] presents the proto¬ 
type data storage system and experimental results. Section 
[6] reviews the related work. Section [7] concludes the paper. 

2. ALMOST STRONG CONSISTENCY 

In this section, we propose the notion of almost strong 
consistency, and instantiate it in terms of 2-atomicity, in 
the context of distributed storage systems. 

2.1 Generic Notion of Almost Strong Consi¬ 
stency 

The distributed storage system consists of an arbitrary 
number of N clients and a fixed number n of server replicas 
(or replicas, for short) that communicate through message¬ 
passing (Figurefl}. Each replica maintains a set of replicated 
key-value pairs (also referred to as registers in the sequel). 

The distributed storage system supports two operations to 
upper-layer applications: 1) storing a value associated with 
a key, denoted write (key,value)', and 2) retrieving a value 
associated with a key, denoted value «— read(key). Clients 
serve as the proxies for applications by invoking read/write 
operations on the registers and communicating with repli¬ 
cas on behalf of them. Being replicated, different versions 
of the same register may co-exist. The concept of consi¬ 
stency models is then introduced to constrain the possible 
data versions that are allowed to be returned by each read. 
Particularly, strong consistency requires each read to obtain 
the latest data version according to some sequential order. 

The notion of almost strong consistency generalizes the 
traditional strong consistency by allowing stale data versions 
to be read. That is, 
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1. It provides deterministically bounded staleness of data 
versions for each read; 

2. It provides probabilistic quantification on the rate of 
“reading stale values”. 

2.2 Almost Strong Consistency in Terms of 2- 
Atomicity 

In the context of distributed storage systems, we investi¬ 
gate almost strong consistency in terms of 2-atomicity. As 
preliminaries, we first review atomicity [&|. From the view 
of clients, each operation is associated with two events: an 
invocation event and a response event. For a read (on a 
specific key), the invocation is denoted read(key), and its 
response has the form ack(value), returning some value to 
the client. For a write, the invocation is denoted write(key, 
value), and its response is an ack, indicating its completion. 
We assume an imaginary global clock and all the events are 
time-stamped with respect to it [2l]. Among all the writes, 
we posit, for each register, the existence of a special one 
which writes the initial value, at the very beginning of the 
imaginary global clock. 

An execution o of the distributed storage system is a se¬ 
quence of invocations and responses. An operation 01 pre¬ 
cedes another operation 02 , denoted 01 -< a 02 (or 01 -< 02 
if <7 is clear or irrelevant), if and only if the response of 01 
occurs in a before the invocation of 02 . Two operations are 
considered concurrent if neither of them precedes the other. 
An execution o is said well-formed if each client invokes at 
most one operation at a time, that is, for each client pi, a\i 
(the subsequence of a restricted on pf) consists of alternat¬ 
ing invocations and matching responses, beginning with an 
invocation. A well-formed execution a is sequential if for 
each operation in cj, its invocation is immediately followed 
by its response. 

Intuitively, atomicity requires each operation to appear to 
take effect instantaneously at some point between its invo¬ 
cation and its response. More precisely, 

Definition 1. A storage system satisfies atomicity 8 ] if, 
for each of its well-formed executions a, there exists a per¬ 
mutation 7r of all the operations in cr such that n is sequential 
and 

• [real-time requirement] If 01 -< a o 2 , then 01 appears 
before 02 in 7r; and 

• [read-from requirement] Each read returns the value 
written by the most recently preceding write in 7r on 
the same key, if there is one, and otherwise returns its 
initial value. 

The semantics of 2-atomicity is adapted from that of ato¬ 
micity by relaxing its read-from requirement to allow stale 
values to be read. 

Definition 2. A storage system satisfies 2-atomicity if, 
for each of its well-formed executions a, there exists a per¬ 
mutation 7r of all the operations in a such that n is sequential 
and 

• [real-time requirement] If 01 -< a o 2 , then 01 appears 
before 02 in it; and 

• [weak read-from requirement] Each read returns the 
value written by one of the latest two preceding writes 
in 7r on the same key. 


In terms of 2-atomicity, the notion of almost strong consi¬ 
stency can then be instantiated as follows. 

1. Besides admitting an implementation with low latency, 
it guarantees that each read obtains the value of within 
the latest 2 versions; 

2. It provides probabilistic quantification on the rate of 
actually reading the stale data version. 

Sections [3] and [f] are concerned with these two aspects, 
respectively. 

3. ACHIEVING 2-ATOMICITY 

In this section, we present the 2AM (2-Atomicity Main¬ 
tenance) algorithm for emulating 2-atomic, Single-Writer 
Multi-Reader (SWMR) registers. It completes both reads 
and writes in one round-trip, and guarantees that each read 
obtains the value of within the latest 2 versions. 

Despite its simplicity, SWMR registers are useful in a wide 
range of applications, especially where the shared data has 
its natural “owner”. Moreover, multiple SWMR registers 
can be used in group. The typical setting is that each process 
has its “own” register i.e., only the owner process can write 
this register, while all processes can read all registers. Mul¬ 
tiple processes can communicate with each other by writing 
its own register and reading other registers. A possible al¬ 
ternative is to use Multi-Writer Multi-Reader (MWMR, for 
short) registers. Compared with using MWMR registers, us¬ 
ing SWMR register in group may be more compatible with 
the application logic, and the implementation is less complex 
and has better maintainability. 

3.1 The 2AM (2-Atomicity Maintenance) Al¬ 
gorithm 

We use the asynchronous, non-Byzantine model, in which: 
1) Messages can be delayed, lost, or delivered out of order, 
but they are not corrupted; and 2) An arbitrary number 
of clients may crash while only a minority of replicas may 
crash. 

The 2AM algorithm is an adaptation from that for ato¬ 
micity |T. It makes use of versioning. Specifically, for each 
write(key, value), the writer associates a version with the 
key-value pair. Each replica replaces a key-value pair it cur¬ 
rently holds whenever a larger version with the same key is 
received. When reading from a key, a client tries to retrieve 
the value with the largest version. Since there is only one 
writer, versions (for each key) can be chosen totally ordered 
using its local sequence numbers. 

At its core, the algorithm is stated in terms of the majority 
quorum systems in the way that each operation is required to 
contact any majority of the replicas to proceed. Specifically, 

• write(key, value): To write a value on a specific key, 
the single writer first generates a larger version than 
those it has ever used, associates it with the key-value 
pair, sends the versioned key-value pair to all the repli¬ 
cas, and waits for acknowledgments from a majority of 
them. 

• read(key): To read from a specific key, the reader first 
queries and collects a set of versioned key-value pairs 
from a majority of the replicas, from which it chooses 
the one with the largest version to return. 
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Algorithm 1 The 2AM (2-Atomicity Maintenance) algo¬ 
rithm for 2-atomic, single-writer multi-reader registers. 

1: procedure WRlTE(key,value) > for the writer 

2: increment version for this key 

3: pfor each replica s > pfor is a parallel for 

4: send [UPDATE, key .value, version] to s 

5: wait for [ack]s from a majority of replicas 

1: procedure read (key) > for each reader 

2: results £- 0 

3: pfor each replica s 

4: send [query, key] to s 

5: obtain result «— [k,val,ver] from s 

6: results «— results U { result} 

7: until a majority of replicas respond 

8: return val with the largest ver in results 

[k,val,ver] : local versioned key-value pairs 

> The following procedure is executed in an uninter¬ 
rupted way. Assume that msg is from client p*. 

1: procedure UPON(msg) > for each replica 

2: if msg instanceof [query, key] 

3: send [k,val,ver] with k = key to client pi 

4: end if 

5: if msg instanceof [UPDATE,key, value .version] 

6: pick [k,val,ver] with k = key 

7: if ver < version 

8: val <— value 

9: ver <— version 

10: end if 

11 : send [ack] to client pi 

12: end if 


As mentioned before, each replica replaces its key-value 
pair whenever a larger version with the same key from a 
write is received. Besides, it responds to the queries from 
reads with the versioned key-value pair it currently holds. 

The pseudo-code for read and write operations and the 
replicas appears in Algorithm |T| Notice that the read here 
does not spend a second round-trip propagating the returned 
value (along with its version) to a majority of the replicas, 
in contrast to that in 7 . The second round-trip in [7] (often 
referred to as the “write back” phase) is required to avoid 
the “old-new inversion” phenomenon 16 , 6 . An “old-new 
inversion” witnesses a violation of atomicity, where two non¬ 
overlapping reads, both overlapping a write, obtain out-of- 
order values. In the 2AM algorithm, we have intentionally 
ignored the “write back” phase. In the following subsec¬ 
tion, we prove that the 2AM algorithm indeed achieves the 
emulation of 2-atomic, single-writer multi-reader registers. 

3.2 Correctness Proof of the 2AM Algorithm 

We aim to prove that, in the 2AM algorithm, the value 
returned by each read is of one of the latest 2 versions. It 
is basically a case-by-case analysis, concerning the partial 
order among and the semantics of the read/write operations. 

Theorem 1. The 2AM algorithm achieves the emulation 
of 2-atomic, single-writer multi-reader registers. 

Proof. First of all, we notice that 2-atomicity, like ato¬ 
micity, is a local property 119]. Therefore, we can prove 


the correctness of the 2AM algorithm by reasoning inde¬ 
pendently about each individual register accessed in an ex¬ 
ecution. Without loss of generality, we assume that all the 
operations involved in the following correctness proof are 
performed on the same register. 

According to the definition of 2-atomicity (Definition [2|, 
it suffices to identify a permutation 7r of any execution of 
the 2AM algorithm, and to prove that 7r is sequential and 
satisfies both the “real-time requirement” and the “weak 
read-from requirement”. 

For any execution a, we obtain its permutation 7r in the 
following manner: 

• All the write operations issued by the single writer are 
totally ordered according to the versions they use. 

• The read operations are scheduled one by one in order 
of their invocation time: A read r that reads from a 
write w is scheduled immediately after both w and 
all the read operations preceding r in the sense of -< a 
(which have already been scheduled). 

Obviously, this permutation 7r is sequential and satisfies 
the “real-time requirement” of 2-atomicity. It remains to 
show that it satisfies the “weak read-from requirement” for 
each read as well. This argument involves a case-by-case 
analysis, concerning the partial order among and the se¬ 
mantics of the read/write operations. 

Here and in the sequel, we use the following notations: For 
an operation o, let o s t denote its start time (i.e., the time of 
its invocation event), o/t its finish time (i.e., the time of its 
response event), and [o s t,Of t ] its time interval (Figure[2]for 
an example). We also write r = R(w ) to denote the “read- 
from” relation in which the read r reads from the write w. 

For any read operation r, we consider two cases according 
to whether there are concurrent write operations with it in 
the execution a. 

Case 1: There is no concurrent write with r. According 
to the 2AM algorithm (Algorithm [l]), especially due to the 
mechanism of the majority quorum systems, the read r must 
read from its most recently preceding write w, and hence in 
7r, it is scheduled between w and the next write. 

Case 2: There are concurrent writes with r, among which 
the leftmost one is denoted w. Notice that r s t £ [w 3 t,Wft] 
holds for w. There are two sub-cases according to the write 
from which r reads. 

Case 2.1: r reads from some concurrent write. In this 
case, r is scheduled in it between this write and its next 
one. 

Case 2.2: r reads from its most recently preceding write 
in ij (denoted w'). Notice that Case 2.1 and Case 2.2 are 
exhaustive since r cannot read from any earlier writes than 
w' due to the mechanism of the majority quorum systems. 
To form an “old-new inversion”, there must be at least two 
read operations. Therefore, in Case 2.2 (shown in Figure[2| 
we now consider other read operations (than r). 

Case 2.2.1: There is no read r' that precedes r in a and 
is concurrent with w. Formally, $r' : r'f t £ [ w 3 t,r 3 t ]. In tv, r 
is scheduled between w' and its next write (i.e., w ). 

Case 2.2.2: There is some read r' that precedes r and is 
concurrent with w. Formally, 3 r' : r'f t £ [w s t,r s t]. 

Furthermore, if r' reads from w, we obtain an “old-new 
inversion”, where two non-overlapping reads (i.e., r and r'), 
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Figure 2: Old-new inversion. Two non-overlapping 
reads r and r' , both overlapping the write w, obtain 
out-of-order values. (Time goes from left to right.) 


both overlapping a write (i.e., w), obtain out-of-order val¬ 
ues. The scenario is depicted in Figure [2j where the dotted, 
directed arrows denote the read-from relation. 

In this situation, both r and r' are scheduled in 7r be¬ 
tween w and its next write. As a consequence, r reads from 
w' which is its second most recently preceding write in 7r, 
meeting the “weak read-from requirement” of 2-atomicity. 

Notice that Case 2.2.2 (and thus the “old-new inversion” 
phenomenon) is the only case which leads to the violations 
of atomicity. □ 

4. QUANTIFYING THE ATOMICITY VIO¬ 
LATIONS 

In this section, we quantify the atomicity violations in¬ 
curred in the 2AM algorithm. It follows from the correct¬ 
ness proof in Section |3.2| that the atomicity violations are 
exactly characterized by the “old-new inversions” in Case 
2.2.2. Furthermore, the proof has also identified the nec¬ 
essary and sufficient condition for the “old-new inversions” 
phenomenon. We formally define it as follows. 

Definition 3. The old-new inversion involving a read r 
consists of the read r, two writes w and w ', and a second 
read r ', such that (see Figure [2| 

1) r s t G [w s t,Wft], 2) w' immediately precedes w: w' -< w, 
and no other writes are between w and w', 3) r/ t G [w s t, r s t], 

4) r = R(w'), and 5) r' = R(w). 

The five requirements for “old-new inversion” fall into two 
categories. The first three requirements involve the par¬ 
tial order -< on, and thus the concurrency patterns among, 
read/write operations. Intuitively, the higher degree of con¬ 
currency an execution shows, the more “old-new inversions” 
it may produce. 

Definition 4- The concurrency pattern involving a 
read r consists of the read r, two writes w and w', and a 
second read r', such that 

1. r s t G [wst, Wft] 

2. w' immediately precedes w: w’ -< w, and no other 
writes are between w and w' 

3. r' ft G [w st ,r st ] 

The concurrency pattern itself is not sufficient for old-new 
inversion. Only when the read/write semantics in the last 


two requirements of Definition[3]is also satisfied, does an old- 
new inversion arise. Thus, we define the read-write pattern 
conditioning on a concurrency pattern as follows. 


Definition 5. Given a concurrency pattern consisting of 
r,r',u>, and w ', exactly as those in Definition [4] the read- 

write pattern requires 

4. r = R(w') 

5. r' = R(w) 


In this way, an “old-new inversion” occurs if and only 
if the read-write pattern arises given that a corresponding 
concurrency pattern has emerged. A concurrency pattern 
may contain more than one such r' defined in Definition [ 4 ] 
as illustrated in Figure |2] Let R' be a random variable de¬ 
noting the number of rs in a concurrency pattern. Then, 
a read-write pattern arises if for some r', Definition [ 5 ] is 
satisfied. Therefore, the probability of “old-new inversions” 
conditioning on R' = m (m > 1; m can be as large as the 
number of all read operations) is the product of the prob¬ 
ability of the concurrency patterns conditioning on R' = m 
and the probability of the read-write patterns conditioning 
on R' = m. By the law of total probability, we obtain 

P{violation of atomicity} = P{ONI} 

= V P{ONI | R' = m} 

m> 1 (4-1) 

= p { cp I R ' = m > x p {RWP I R' = m}. 

m> 1 


In the following two subsections, we propose a stochastic 
queueing model and a timed balls-into-bins model to analyze 
the concurrency pattern and read-write pattern in Equa¬ 
tion (4.11, respectively. The frequently used notations and 
formulas are summarized in Table [l] 


4.1 Quantifying the Rate of Concurrency Pat¬ 
terns 

To quantify the rate of concurrency patterns conditioning 
on R' = m, we need an analytical model of the workload con¬ 
sisting of one sequence of read/write operations from each 
client. For each client, the characteristics of its workload 
are captured by the rate of operations issued by it and the 
service time of each operation (i.e., [o s t,o/t]). We assume a 
Poisson process with parameter A for the former one and an 
exponential distribution with parameter g for the latter one. 
The scenario of each client issuing a sequence of read/write 
operations is then encoded into a queueing model. 

We thus consider N independent, parallel M/M/1 queues 
(i.e., a single-server exponential queueing system), all with 
arrival rate A and service rate g 24 . For each M/M/1 


queue, we use the “first come first served” discipline and as¬ 
sume for simplicity that, if there is any operation in service, 
no more operations can enter it. The queue Qo represents 
the single writer. 

To compute the probability that a concurrency pattern 
occurs in such a queueing system in the long run, we go 
through the following three steps. 

Step 1: What is the stationary distribution for any two 
queues? 

Let X‘(t) be the number of operations in queue i at time 
t. Then X l (t) is a continuous-time Markov chain with only 
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Table 1: Notations and formulas. 


N: number of clients n\ number of replicas q = \n/2\ + 1 


Beta function: B(x,y) = f 0 t x 1 (1 — t) y 1 dt 
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two states: 0 when the queue is empty and 1 when some 
operation is being served. Its stationary distribution is: 


P o 4p(A-*(~),0)_p 
, ‘ 4P ( I ' ( ") = 1 ) = ^ 


Let Y(t) = (X l (t), Y y (t)) be the vector of the numbers of 
operations in queues Qi and Qj. Since any two queues are 
independent, Y(t) is a continuous-time Markov chain with 
four states (0, 0), (0,1), (1, 0), and (1,1). Its stationary dis¬ 
tribution is: 


where, 


Po, i 


Po,0 = 
= Pi,0 


M P 
p + X p + X 

p A 

= ¥+W 


(m + a) 2 


Pi, i 


A 2 

(m + a) 2 


Pi, i = P(Y(po) = {i,j)) i,j€{ 0,1}. 


Step 2: Given a read r in Qi, what is the probability of 
the event, denoted E, that it start-s during the service pe¬ 
riod of some write w in Q o (formally , r s t € [w s t,w/t] in 
Definition [^p ? 

The probability of E equals the probability that when 
r arrives at Qi , it finds Qi empty (denoted Ef) and as a 
bystander Qo full (denoted Eo). Since events Ei and Eq are 
independent, we have 


P{E) = P(Ei A Eo) = P{Ei ) • P(E 0 ) 

= P 0 ■ Pi (by the PASTA property[24]) 
pX 

= ¥+W 


Step 3: Conditioning on Step 2, what is the probability of 
the event, denoted En-i,™,, that there are totally m read op¬ 
erations (denoted r') in N— 1 queues (besides Qo) which fin¬ 
ish during the time period \w a t,r s t\ (formally, r) t £ [w s t, r s t] 
in Definition^)? 

First, the length L = r s t — w s t of the time period [w si , r a t\ 
is exactly the inter-arrival time of Qi, which is exponential 
with rate A. 

The calculations in Appendix 0 yield 
P{CP | R' = m } = P(Bjv- ljm ) 


N—2 


= £ 



m — 1 
N-k-2 


k N—k — 1 m 
Por s , 


(4.2) 


when m > 1. For the special case m = 0, we have 

P{CP | R' = 0} = P(Siv-i,o) = Po~£ 

Summing over m (m > 1), we also get the probability that 
there exists a concurrency pattern (for some read r): 

P{CP} = 1 - P{CP | R' = 0} = 1 - p"- 1 (4.3) 

4.2 Quantifying the Rate of Read-Write Pat¬ 
terns 

Given the concurrency patterns, we further quantify the 
rate of read-write patterns conditioning on R' = m: 

r = R(w') A 3 2 : 2 = R(w) 

where r' is among the m read operations in Step 3 in Sec¬ 
tion [4T] To this end, we shall explore in detail the majority 
quorum systems used in the 2AM algorithm. We assume 
that 1) no node failure or link failure occurs; and 2) to com¬ 
plete an operation (read or write), the client accesses all the 
n replicas and wait for the first q = |_rz/2J +1 acknowledg¬ 
ments from them. It follows that: 

P{RWP | R' = m} 

= P{r = R(w') A 32 : r' = R(w)} 

< P{r 2 R(w) A 32 : 2 = R(w)} 

= P{r 2 P{ w )} x P{32 : 2 = R(w) \ r ^ R(w)} (4.4) 

= P{r ^ R(w)} x ^1 — P{2 ^ R(w) | r ^ Pi'w)}™') 

where r R(w) (resp. r' 2 P{ w )) denotes that r (resp. 
2) does not read from w. The inequality is due to the fact 
that r = R(w') implies r R(w). We then focus on the 
calculations of P{r ^ R( w )} an d P{2 ^ R(w) \ r R(w)}. 

Which write would be read from by some read depends 
on the states of the replicas from which it collects the first 
[_n/2j + 1 acknowledgments. The states of the replicas fur¬ 
ther depend on the timing issues in the 2AM algorithm, such 
as message delays and the time lag between the events that 
the messages are sent. Taking into account the timing issues, 
we propose the timed balls-into-bins model for the read and 
write procedures in the 2AM algorithm. Let D r (resp. D w ) 
be a (non-negative) continuous random variable denoting 
the message delay for read (resp. write) operations dur¬ 
ing a communication round-trip. Let T be a (non-negative) 
continuous random variable denoting the time lag between 
the time when two messages of interest are sent, and t a 
realization (or called an observed value) of T. 
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a) Probability of concurrency patterns 


b) Conditional probability of concurrency patterns 




Number of read operations r' (i.e., R' = m) 


Figure 3: The probability of concurrency patterns: a) with vs. without concurrency patterns; b) conditioning 

on R' = m (A = 10s -1 ,/r = 10s _1 ). 


In the timed balls-into-bins model , there are n bins (cor¬ 
responding to n replicas). Consider two robots Ri and R 2 
(corresponding to read or write operations) which can pro¬ 
duce multiple balls (corresponding to messages) instanta¬ 
neously. At time 0, robot Ri 1 ) produces n balls instanta¬ 
neously; 2) Immediately these n balls are independently sent 
to the n bins, one ball per bin; 3) The delays for the balls 
going from the robot to its destination bin are independent 
and identically distributed with the same distribution as D r 
or D w as defined above, depending on whether the robot 
represents a read or a write. 

At time t (defined above), robot R 2 independently does 
exactly the same thing as robot Ri does (i.e., 1), 2), and 3) 
for robot Ri above). 

Each probability to calculate is related to an event in an 
instantiation of the timed balls-into-bins model. 

To calculate P{r 7 ^ R(w)}, we are concerned with the 
model in which the robots Ri and R 2 represent the write 
operation w 1 and the read operation r involved in an “old- 
new inversion”, respectively. Furthermore, we assume that 
the random variable D r (resp. D w ) for time delay is expo¬ 
nentially distributed with rate A r (resp. \ w ). The time lag 
T between the events that w' and r are issued (meanwhile 
messages are sent to replicas) corresponds to the time period 
[w s t,r a t]- That is to say, T is an exponential random vari¬ 
able with rate A, as shown in Section [4. f| (See Step 3). For 
simplicity, we take the time lag t to be the expectation of 
T, i.e., t s® A. Finally, we are interested in the time point t' 
when exactly q = \ n/2\ + 1 ) of the n bins have received the 
balls from R 2 (i.e., r), and denote the set of these \n/2\ + 1 
bins by B. In terms of the timed balls-into-bins model, the 
case of r ^ R(w) corresponds to the event E that none of 
the q = [n/2j + 1 bins in B receives a ball from Ri (i.e., w) 
before it receives a ball from R 2 (i.e., r). 

The calculations in Appendix |B. 1 1 yield 


P{r 7 ^ R{w)} 


-g\ w t u q B (q, a(n - q) + 1) 
B{q,n-q+ 1) 


(4.5) 


where a = ^ and B denotes the Beta function. 

To calculate P{r' 7 ^ R(w) \ r 7 ^ R(w')}, we introduce a 
slightly generalized timed balls-into-bins model. In the new 
model, robot R 2 picks p (0 < p < n) bins uniformly at 
random (without replacement) and sends a ball to each of 
them (see Appendix|B.2|). The calculations in Appendix|B.3| 


yield 


f J i ; 

{r 7 ^ R(w ) | r 7 ^ R{w)} = < B( q ,n-q+ 1 ) 


if n > 2 , 
if n = 2. 


(4.6) 


Substituting Equations (4.5 ) and (4.6 1 into Equation ( |4.4| | 
gives, for n > 2 , the rate of read-write patterns conditioning 
on R' = m: 


P{RWP | R' = m} 

< P{r 7 ^ R(w)} x (l - P{r' 7 ^ R(w) \ r 7 ^ R(w)} m ^ 

< -g\ w t a q B (q, a(n - q) + 1) 

" C B(q,n-q+ 1) 

(4 - 7) 

For n = 2, we have 

P{RWP | R' = m} = 0. 

Notice that P{RWP | R' = 0} = 0 since there are no con¬ 
currency patterns at all. 

4.3 Numerical Results and Discussions 

In light of the complicated analytical formulation, we pre¬ 
sent the numerical results on concurrency patterns, read- 
write patterns , and old-new inversions. The numerical re¬ 
sults have not only demonstrated that “old-new inversions” 
(and thus, atomicity violations) rarely occur as expected, 
but also clearly revealed that the read-write patterns domi¬ 
nate in guaranteeing such rare violations. 

Figure[3]presents the probability of concurrency patterns, 
given A = 10s _1 and p, = 10s _1 , meaning that the expected 
arrival rate is 10 operations per second and the expected 
service time is 100 ms. First of all, Figure |3^i) shows that 
the probability of concurrency pattern is quite high, and it 
rapidly increases with the number of clients. For example 
when N = 15, it nearly reaches 1: intuitively, for each read 
r, there almost always exist concurrency patterns involving 
it. Figure §>) further explores the probability of concur¬ 
rency patterns conditioning on the number m of reads r 1 
(i.e., P{CP | R 1 = m}). Here m = 0 indicates that there 
are no concurrency patterns at all, corresponding to the 
(square-marked) line at the bottom in Figure [3^,). One key 
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Figure 4: The probability of read-write patterns 

(A = 10 s~ 1 ,n= 10s _1 , A r = Aw = 20s -1 ). 



Figure 5: The probabilities of concurrency pat¬ 
terns, read-write patterns, and old-new inversions 

(A = 10s -1 , n = 10s -1 , A r = Aw = 20s -1 , AT = n). 


observation from Figure [3]o) is that the conditional prob¬ 
ability of concurrency patterns concentrates on the small 
values of tn’s, and for each N the value of m which achieves 
the maximum is smaller than N. This observation partly 
justifies the assumption made in the mo del f or calculating 
P{r' 7 ^ R(w) | r 7 ^ _R(w)} (see Appendix B.2| that there is 


at most one such r' in a single (client) process. 

Figure [fj as well as Table [2] presents the probability of 
read-write patterns, given A = 10s -1 , p = 10 s -1 , and A r = 
Aw = 20s -1 . The latter two parameters mean that the ex¬ 


pected message delay is 50 ms. According to Equation (4.71 


we distinguish the probability P{r 7 ^ f?(w)} from another 
one 1 — P{r' 7 ^ R(w) \ r 7 ^ R(w)} m (in Figure[4] we take the 
extreme value of m = 1 ), and observe that the former domi¬ 
nates in keeping the probability of read-write patterns quite 
low. The observation that P{r 7 ^ 7?(w)} is quite low has 
demonstrated the effectiveness of the majority quorum sys¬ 
tem used in the 2AM algorithm, under which a read would, 
with a high probability, not miss a concurrent write that 
starts earlier. In addition, if a read r has happened to miss 
such a concurrent write, it is still quite likely to avoid an 
old-new inversion: r can reasonably infer, from the low val¬ 
ues of 1 — PjV R(w) | r R(w)}, that the preceding 
reads r' would not have read from that write either. 

Substituting Equations (4.2 1 and (|4.7|) into Equation 


(4.1 1 , we obtain the rate of violating atomicity: 


P{violation of atomicity} = P{ONI} 

= Y P { CP I R ' = m } x p { R WP | R' = m} 


m> 1 
/ N — 2 


E 


N - 1 


TO — 1 

N — k — 2 


k N—k—1 m 
p 0 r s 


— q x w t ct q B(g,a(n-q) + l ) 


1 - 


B(q,n-q+ 1) 

Ji 

B(q,n-q + 1) 


(4.8) 


Notice that Equation (4.8 1 is an approximation since the 


timed balls-into-bins model used for calculating the prob¬ 
ability of read-write patterns (specihca lly, f or the case of 
{r' 7 ^ R(w) | r 7 ^ R(w)} in Appendix B.2| assumes that 
there is at most one such r' in a single process, while the 
model for calculating the probability of concurrency pat¬ 
terns does not. 


Figure [5j as well as Table [3] presents the probability of 
old-new inversions according to Equation (4.8 1 with N = n. 
We also list the probabilities of concurrency patterns and 
read-write patterns, which are calculated as follows: 


JV-l 

P{CP} = y p { CP I R ' = m > 

m= 1 

N -1 

P{RWP | CP} = Y p { R WP | R' = m}. 

m= 1 

Based on Figure [5] and Table [3J we first observe that the 
probability of old-new inversions (and thus, atomicity vio¬ 
lations) is sufficiently small, demonstrating that 2 -atomicity 
and the 2AM algorithm is “good enough” in distributed 
storage systems. More importantly, it also reveals that the 
read-write patterns dominate in guaranteeing such rare vio¬ 
lations, compared to the concurrency patterns which occur 
quite often. 

Notice that the principles underlying our theoretical anal¬ 
ysis (as well as the numerical analysis) have been decou¬ 
pled from the assumptions we adopt about the networks 
and workloads. These principles mainly consist of the intro¬ 
duction to old-new inversion, the decomposition of it into 
concurrency pattern and read-write pattern, the queueing 
model for analyzing concurrency patterns, and the timed 
balls-into-bins model for analyzing read-write patterns. Net¬ 
work conditions and workload types may vary in different 
scenarios. However, the principles and the methodology of 
our analysis still apply. 


5. EXPERIMENTS AND EVALUATIONS 

In this section, we empirically study the 2AM algorithm. 
To this end, we have implemented a prototype data stor¬ 
age system among mobile phones, which provides 2 -atomic 
data access based on the 2AM algorithm and atomic data 
access based on the ABD algorithm. We compare the read 
latency in both algorithms. We also measure the proportion 
of atomicity violations incurred in the 2AM algorithm. 

5.1 Experimental Design 

Our prototype system comprises a collection of Google 
Nexus5 smartphones (CPU: Qualcomm Snapdragon™ 800, 
2.26GHz, Memory: 16GB, Android: 4.4.2), equipped with 
72Mbps wireless LAN. In both algorithms, each phone acts 
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Table 2: Numerical results on the probabilities of {r ^ R(w)} and 1 — {r' ^ R.(w) \ r 7 ^ R(w)}. 


replicas 

P{r ^ R(w)} 

1 - P{r' + R(w) 

1 r + R(w)} 

ff replicas 

P{r R(w)} 

1 - P{r' R(w) 

1 r + R(u>)} 

2 

0.00457891 

1.0 

9 

8.51249 x 10 ~ 6 

0.0243758 

3 

0.00732626 

0.0409628 

10 

7.20025 x 10 ~ 7 

0.0353241 

4 

0.000566572 

0.0561367 

11 

8.89660 x 10 ~ 7 

0.0203645 

5 

0.00077461 

0.0356626 

12 

7.60436 x 10 ~ 8 

0.0294186 

6 

0.0000628992 

0.0511399 

13 

9.28973 x 10 ~ 8 

0.0171705 

7 

0.0000813243 

0.0294467 

14 

8.00055 x 10 ~ 9 

0.0246974 

8 

6.77295 x 10 ~ 6 

0.0426608 

15 

9.69478 x 10 ~ 9 

0.0145951 


Table 3: Numerical results on the probabilities of concurrency patterns, read-write patterns, and old-new 


inversions. 


# replicas 

P{CP} 

P{RWP | CP} 

P{ONI} 

# replicas 

P{CP} 

P{RWP | CP} 

P{ONI} 

2 

0.28125 

0 . 

0 . 

9 

0.9447 

7.06025 x 10 " 6 

7.30744 x 10 ~ 7 

3 

0.518555 

0.00088802 

0.000203683 

10 

0.95874 

1.04312 x 10 “ 6 

9.93356 x 10 " 8 

4 

0.677307 

0.000183791 

0.0000352958 

11 

0.968604 

9.37995 x 10 “ 7 

8.16935 x 10 " 8 

5 

0.781222 

0.000266569 

0.0000437181 

12 

0.975675 

1.34085 x 10 “ 7 

1.08822 x 10 -8 

6 

0.849318 

0.0000450835 

6.49226 x 10 “ 6 

13 

0.98085 

1.16911 x 10 “ 7 

8.77158 x 10 " 9 

7 

0.89429 

0.0000478926 

6.08721 x 10 “ 6 

14 

0.984717 

1.63195 x 10 “ 8 

1.15178 x 10 -9 

8 

0.924335 

7.43561 x 10 “ 6 

8.53810 x 10 “ 7 

15 

0.987662 

1.39573 x 10 “ 8 

9.18283 x 10 _1 ° 


as both a client and a server replica. As a client, it collects 
its own execution trace for offline analysis. Clocks on the 
phones are synchronized with the same desktop computer. 

We explore three kinds of parameters: 1) algorithm pa¬ 
rameters: replication factor (i.e., the number of phones) and 
consistency levels (i.e., atomicity or 2 -atomicity); 2) work¬ 
load parameters: the number of read/write operations is¬ 
sued by each client and the issue rate on each client; and 3) 
network parameter: the injected random delay in network 
communication, modeling the various degrees of asynchrony. 

We are concerned with two metrics: 

Latency: We compare the read latency in both algorithms 
by varying the replication factors and the issue rates of oper¬ 
ations in the workload. Each client issues reads/writes at a 
Poisson rate A (= 5, 10, 20, 50, 100, or 200) per second. For 
each A, the replication factors vary from 2 to 5. Each reader 
issues 50, 000 read operations. The single writer issues only 
write operations. In addition, the size of the keyspace is 
fixed to 1. The key takes integer values from 0 to 4. 

Violations of atomicity: We quantify the violations of 
atomicity incurred in the 2AM algorithm by varying the 
replication factors and the network delays. The replication 
factors vary from 2 to 5. For each replication factor, the 
injected random delays in network communication are uni¬ 
formly distributed over integers in [ 0 , r) (r can be 10 , 20 , 
50, 100, and 200 ms). Each client issues 200, 000 opera¬ 
tions. The single writer issues only write operations. On 
each client, operations arrive at a Poisson rate of 50 per sec¬ 
ond so that the system operates at its full capacity. The 
size of the keyspace is 1 and the “hotspot” key takes integer 
values from 0 to 4. 


where the box indicates the median and the 25th and 75th 
percentile scores, while the whiskers indicate variability out¬ 
side the lower and upper quartiles. The medians are marked 
by the white lines between boxes. The outliers (probably 
due to the garbage collection in phones) are not shown. 

As indicated in Figure [ 6 ] the read latency is significantly 
reduced using the 2AM algorithm which completes each read 
in one round-trip. In the case of 5 replicas, the reduction of 
the latency is about 29%. 

Figure [ 6 ] also shows that the issue rate of operations on 
each client has little impact on the read latency. This is due 
to the fact that in both algorithms, reads or writes proceed 
independently, especially without waiting for each other (the 
cases of “rate = 5”, “rate = 20”, and “rate = 100” are thus 
not shown). On the other hand, the more the replicas are 
involved, the higher the read latency is incurred. This is 
because each read needs to contact all the replicas and waits 
for acknowledgments from a majority of them. 

5.3 Experimental Result 2: Atomicity Vio¬ 
lations 

To measure the proportion of the atomicity violations in¬ 
curred in the 2AM algorithm, we count the number of read 
operations (#R) and the occurrences of concurrency pat¬ 
terns (#CP) and read-write patterns (#RWP). Because 
each concurrency pattern or each read-write pattern is as¬ 
sociated with some read operation r, we are concerned with 
the following quantities: 


P(CP) = ^^,P(RWP|CP) = #R ^ ) F , P(ONI) 


#RWP 

#R 


5.2 Experimental Result 1: Latency 

We visualize the latency data using box plots (Figure | 6 |, 
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In this manner, the proportion of “old-new inversions” 
(and thus the violations of atomicity) P(ONI) equals the 
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Figure 6: Comparison of read latency in both the ABD atomicity algorithm and the 2AM algorithm. 
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Figure 7: The proportions of concurrency patterns P(CP), read-write patterns among concurrency patterns 
P(RWP|CP), and “old-new inversions” P(ONI). (logO is not defined and thus not shown.) 


product of P(CP) and P(RWP|CP): 

P(ONI) = P(CP) • P(RWP|CP). 


(5.1) 


Notice that this is not the case in theory, according to Equa¬ 
tion (4.1l. Therefore, Equation (5.11 is a practical approxi¬ 


mation to Equation (4.11 in theory, without going into the 


details of conditioning on R' = m. The feasibility of such an 
approximation will be justified by the experimental results 
presented shortly, in the sense that the key observations 
drawn from the numerical results based on the equati ons i n 
theory fit well with the empirical data and Equation <E3- 
Due to the limited space, Tables [4] and [5] summarize part 
of the experimental results (also shown in Figure [7]). In Ta¬ 
ble |4j the replication factor is 5 (thus the number of read 
operations is 800, 000) and the parameter of async varies 
from 10 ms to 200 ms. In Table[5] the parameter of async is 
50 ms and the replication factors vary from 2 to 5. As shown 
in Table [4] the higher the degree of asynchrony is, the more 
concurrency patterns there are. On the other hand, the 
number of occurrences of concurrency patterns grows as the 
replication factor increases (Table pi). Accordingly, the pro¬ 
portion of concurrency patterns P(CP) also increas es al ong 
with the replication factor, as implied by Equation (4.31. 


For the number of read-write patterns, the experimental 
results exhibit three features. First, no read-write patterns 
(and thus no “old-new inversions”) arise in only 2 replicas. 
This is because both read and write operations are required 
to contact both replicas to complete. Second, there are fewer 
read-write patterns in the case of 4 replicas than those in 
the case of 3 or 5 replicas. In the case of 4 replicas, each 


read contacts 3 replicas according to the mechanism of the 
majority quorum system, accounting for 75% of them, and 
gains more opportunities to obtain the latest data version. 
For 3 or 5 replicas, the majorities account for 66.7% and 
60%, respectively. (Notice that the majority accounts for 
100% in the case of 2 replicas.) Third, Table[4]shows that the 
degree of asynchrony also contributes to the occurrences of 
read-write patterns since it may lead to out-of-order message 
delivery in the timed balls-into-bins model (Section |4.2| ). 

One of the most important observations concerning these 
experiments is that they have confirmed our theoretical anal¬ 
ysis in Section [4.3| First, the proportion of “old-new inver¬ 
sions” P(ONI) is quite small (less than 0.1%o in most execu¬ 
tions), demonstrating that 2-atomicity is “good enough” in 
data storage systems regarding the violations of atomicity. 
More importantly, the proportion of read-write patterns am¬ 
ong concurrency patterns P(RWP|CP) is much less than 
that of concurrency patterns P(CP) themselves. Namely, 
although concurrency patterns appear frequently (e.g., ac¬ 
counting for more than 50% in the setting of 5 replicas and 
50 ms async), only a quite small portion of them satisfy the 
read-write semantics of read-write pattern (Definition [5| to 
constitute the “old-new inversions” (e.g., about 0.1%o in the 
same setting). It follows that the read-write patterns domi¬ 
nate in guaranteeing such rare atomicity violations incurred 
in the 2AM algorithm. 

In conclusion, the experimental results (which have con¬ 
firmed the theoretical analysis) show that 2-atomicity and 
the 2AM algorithm are “good enough” in distributed stor¬ 
age systems, by achieving low latency, bounded staleness, 
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Table 4: The numbers and proportions of concurrency patterns and read-write patterns (replication factor = 

5)- _ 


ff async (ms) 

ff read 
operations 

ff concurrency 
patterns 

ff read-write 
patterns 

P(CP) 

P(RWP CP) 

P(ONI) 

10 

800,000 

269,061 

47 

0.336326 

0.000174682 

0.00005875 

20 

800,000 

306, 274 

44 

0.382843 

0.000143662 

0.000055 

50 

800,000 

428, 344 

44 

0.53543 

0.000102721 

0.000055 

100 

800,000 

549,102 

83 

0.686378 

0.000151156 

0.00010375 

200 

800,000 

627,814 

100 

0.784768 

0.000159283 

0.000125 


Table 5: The numbers and proportions of concurrency patterns and read-write patterns (async = 50 ms). 


ff replicas 

ff read 
operations 

ff concurrency 
patterns 

ff read-write 
patterns 

P(CP) 

P(RWP CP) 

P(ONI) 

2 

200,000 

66,985 

0 

0.334925 

0 

0 

3 

400, 000 

192,902 

83 

0.482255 

0.00043027 

0.0002075 

4 

600, 000 

280,091 

6 

0.466818 

0.0000214216 

0.00001 

5 

800, 000 

428, 344 

44 

0.53543 

0.000102721 

0.000055 


and rare atomicity violations. 

6. RELATED WORK 

We divide the related work into three categories: consi¬ 
stency/latency tradeoff, complexity of emulating atomic reg¬ 
isters, and quantifying weak consistency. 

Consistency/latency tradeoff. Designing distributed stor¬ 
age systems involve a range of tradeoffs among, for instance, 
consistency, latency, availability, and fault-tolerance. The 
well-known CAP theorem 11] states that it is impossible for 
any distributed data storage system to achieve consistency, 
availability, and network-partition tolerance simultaneously. 
More recently, another tradeoff — between consistency and 
latency — has been considered more influential on the de¬ 
signs of distributed storage systems, as it is present at all 
times during system operation [2]. 

In this paper we study the consistency/latency tradeoff 
and propose the notion of almost strong consistency as a 
better balance option for it. 

Complexity of emulating atomic registers. The ABD algo¬ 
rithm for atomicity 0 0 emulates the atomic, single-writer 
multi-reader registers in unreliable, asynchronous networks, 
given that a minority of nodes may fail, ft requires each read 
to complete in two round-trips. Dutta et al. 16] proved that 
it is impossible to obtain a fast emulation, where both reads 
and writes complete in one round-trip (i.e., low latency in 
our terms). Georgiou et al. 17] studied the semi-fast emula¬ 
tions (of atomic, single-writer multi-reader registers) where 
most reads complete in one round-trip. Guerraoui et al. 
considered the best-cases complexity, assuming synchrony, 
no or few failures, and absence of read/ wri te contention. In 
this situation, fast emulations do exist [l8|. 

We investigate the notion of almost strong consistency in 
terms of 2-atomicity, namely, to emulate 2-atomic, single¬ 
writer multi-reader registers. Our 2AM algorithm completes 
both reads and writes in one round-trip. 

Quantifying weak consistency. Weak consistency can be 
quantified from four perspectives: data versions, random¬ 
ness, timeliness, and numerical values. Modern distributed 


storage systems often settle for weak consistency and allow 
reads to obtain data of stale versions 15 , 14 . The seman¬ 


tics of fc-atomicity jH] guarantees that the data returned is of 
a bounded staleness. Without guarantee of bounded stale¬ 
ness, random registers [22] provide a probability distribution 
over the set of out-of-date values that may be returned. Us¬ 
ing PBS (Probabilistically Bounded Staleness) 9], one can 
obtain the probability of reading one of the latest k versions 
of a data item. Timed consistency models [25 require writes 
to be globally visible within a period of time. PBS [9 also 
calculates the probability of reading a write t seconds af¬ 
ter it returns. TACT [26], a continuous consistency model, 
integrates the metric on numerical error with staleness. 

The 2-atomicity (and almost strong consistency) seman¬ 
tics integrates bounded staleness of versions with random¬ 
ness. Our 2AM algorithm completes each read in one round- 
trip, in contrast to that of fc-atomicity !5 . It differs from 
random registers [22] and PBS [9] in two aspects: First, it 
provides guarantee of deterministically bounded staleness. 
Second, the rate of violations is quantified with respect to 
atomicity instead of regularity (as in [22] and 9|), which is 
more challenging since we shall deal with concurrent oper¬ 
ations. To do this, we propose a stochastic queueing model 
for analyzing the concurrency pattern first and then a timed 
balls-into-bins model for analyzing the read-write pattern. 


7. CONCLUSION AND FUTURE WORK 

In this paper we propose the notion of almost strong consi¬ 
stency as a better balance option for the consistency/latency 
tradeoff. It provides both deterministically bounded stale¬ 
ness of data versions for each read and probabilistic quan¬ 
tification on the rate of “reading stale values”, while achiev¬ 
ing low latency. In the context of distributed storage sys¬ 
tems, we investigate almost strong consistency in terms of 
2-atomicity. Our 2AM (2-Atomicity Maintenance) algo¬ 
rithm completes both reads and writes in one communica¬ 
tion round-trip, and guarantees that each read obtains the 
value of within the latest 2 versions. We also quantify the 


11 
































rate of atomicity violations incurred in the 2AM algorithm, 
both analytically and experimentally. 

We identify three problems for future work. First, it is 
worthwhile to conduct more intensive simulations or exper¬ 
iments, in order to reveal the key parameters and guiding 
principles for distributed storage system design. Second, we 
plan to study 2-atomic, multi-writer multi-reader registers. 
One key problem is whether they admit implementations 
which complete both reads and writes in one round-trip. 
Finally, we hope to extend the notion of almost strong consi¬ 
stency from shared registers to snapshot objects. 
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APPENDIX 

A. CALCULATIONS OL p (E„- ltM ) IN SEC¬ 
TION 4.1 

In this section, we compute the probability of the event, 
denoted En-i,™., that there are totally m read operations 
in IV — 1 queues (besides Q o) which finish during the time 
period [w a t,r B t] (Step 3 in Section 4.1 1 . 

We first consider a single queue. Let D be a random 
variable denoting the number of operations in one particular 
queue which finish during the time period r 3 t — w s t- Its 


probability distribution P(D = d) is given in Appendix A.l 


Then, we take into account all the N — 1 (N > 1) queues, 
besides Q o. The calculations of P(i?jv-i, m ) are given in 
Appendix |A.2| 


A.l Calculations of P(£> = d) 

Let D be a random variable denoting the number of oper¬ 
ations in one particular queue which finish during the time 
period L = r s t — w 3 t . To compute its probability distribu¬ 
tion, we condition on whether w sees this queue as empty 
(denoted as an event E$) or not (denoted as an event E+q). 

1) If it sees this queue empty (with probability oo = ), 

then the number of departures, during the time period r a t — 
w 3 t, has the conditional distribution: 


= 


d 1 Eg) 

F(L < A 0 + So) 

if d = 0 

d 

d+l 

P( y](A; + Si) < L 

< J2( Ai + Si )) if d > 1 

i =1 

i=1 

2 A +ii 

2(A+ m ) 

if d = 0 

(1 _ 1 v V 1 ^ \ d 

2 /x+A'V2 ii+\) 

if d > 1 


where A; are independent and identically distributed (iid) 
exponential random variables with parameter A correspond¬ 
ing to the inter-arrival times of operations in the other queue, 
and Si are iid exponential random variables with parameter 
fi corresponding to the service time of these operations. 
Here we briefly demonstrate the calculation of 

d d+l 

P( ^(A; + Si) < L < J2( A i + Si)) (when d > !)■ 


It follows from the independence assumptions that, 

— w(&- XR d — e ~ XR d+i\ 
d+l 


P (R d <L< R d+ 1 ) = E(e~ XIid - e~ XKd+1 ) 


= []E(e-^ +s ‘ ) ) -[jEte-^^) 

i= 1 i=1 

-A(Ai+Si)i\d /'ibY„-M a 1 + Si)\'A+1 


(E(e 
, 1 A 4 
"2 +pA' 




)) -(E(e- 

1 M \d+l 

2 /x + A 


ny 


= (1 

x O „ I \Ao „ I 


a i 


2 // + A 2 /i + A 

2) Similarly, if it sees this queue full (with probability 
FTa)> we have 

F(D = d\ E#) 

P (L < So) 


= 


if d = 0 


:'£s,. + '£a,<l 


A 

/x+A 


d+l d 

<Y, Si+ J2 Ai ) ifd >! 

i =1 i= 1 


if d = 0 


l ^( l 2^h) d Ad >l 

Using the law of total probability, we obtain 
P (D = d) 

= = d\E< 6 ) + (D = d I £+ 0 ) 

I( 1 + (hta) 2 ) Ad = 0 
ifd> 1 


(A.l) 


A.2 Calculations of p(Bjv-i,m) 

Taking into account all the N — 1 (N > 1) queues, besides 
Q o, we can compute the probability of the event, denoted 
En-i,tti, that there are exactly m read operations which 
finish during L = r s t — w 3 t by modeling it as a balls-into- 
bins problem. 

There are N — 1 bins, labeled with 1,2 ,N — 1. Let 
X, be a random variable denoting the number of balls con¬ 
tained in the i-th bin. The collection of random variables 
Xi is independent and identically distributed, with the same 
probability distribution 


For convenience, we write 


d+l 


R d = ^(A ( + St) and R d+1 4 ^(A, + Si). 


As L is an exponential random variable with parameter A 
and is independent of Rd, we have 


p x = P(A' ; = x) = 


|(l + (^y) 2 ) if a; = 0 

( 2X+/J,) 2 _ f i _+_\ if x > 1 

2(ii+\) 2 2 ti+X) 11 ■' - 1 


We want to compute the probability of the event, denoted 
-Ejv-i.m, that there are in total m balls in these N — 1 bins. 
For convenience, we write 


/ 


F(R d <L)= I P(L > x)dP Rd {x) = E(e~ XRd ). 


r4 (2A+ /xg andfi A I_M_. 

2(/x + A) 2 2/x + A 
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First assume m > 0. Let K be a random variable de¬ 
noting the number of empty bins. Suppose there are fc 
(0 < k < N — 2) empty bins (i.e., K = k). In this case, 
we are partitioning integer m into a sum of N — 1 integers 
such that k of them are 0 and TV — 1 — k of them are posi¬ 
tive. There are (^^T 1 ) (jv^fc-2) wa y s °f partitions. For each 
partition 


m = mi+m 2 -|-b m k + m k+1 + m k+2 4-b m N -1 

such that rrii = 0 for 1 < i < k and m; > 0 for k + 1 < i < 
N — 1, the probability that the i-th bin contains m; balls is 

Po • (r • s fc+1 )(r-s J • • • (r • s ) = Po • r -s 

Therefore, the probability that there exist k (0 < k < 
N — 2) empty bins is 


F(EN-i,m,K = k ) 


N — 1 \ / m — 1 


k J \N — k — 2 
Summing over all k yields (recall that m > 0) 

N—2 

P(i?iV-l,m) = ^2, P((i5jV-l,m, K = k)) 
f N — 1 \ / m— 1 


k N—k—1 m 

p 0 r s 


k =0 
N—2 

E 


k 


N - k - 2 


fc AT —fc —1 m 

Po r S 


For the special case m = 0, we have 
P(£jv-i,o) = p^ -1 


B. CALCULATIONS OF p {R = R(w')} IN 
SECTION 4.2 

In this section, we compute the probability of r re adin g 
from w' (i.e., P{r = R(w')}). According to Equation ( |4.4|) , 
we shall compute both P{r 7b R(w)} (see Appe ndix |B. 1[ ) 
and P{r' 7b R(w) \ r 7b R(w)} (see Appendix |B.3[ ). For the 
latter probability, we also introduce a slightly generalized 
timed balls-into-bins model in Appendix |B.2| 

B.l Calculations of P{r / R(w)} in the timed 
balls-into-bins model 

Let q = [n/2j + 1. Denote the delay times for each 
ball from robot R\ (corresponding to w) sent to each bin 
Bi by D' t and the delay times for each ball from robot 
R .2 (corresponding to r) sent to each bin Bi by Di. Let 
M m = maxjTi, D 2 , ■ ■ ■, Dm}. By symmetry, 

P(£)= Qp(S,S = {l,...,g}). 

If D 1 = M q , we shall compute 

II = P{Di > t + ALq, Dn > t + D 2 , . . . , D q > t -b Dq, 
Dq +1 > Mq , Dq+2 > M q , . . . , D n > Mq}. 


Conditioning on M q = Di , D 2 ,..., and D q and using the 
independence assumptions, we obtain: 


h = 


i:i-l 


0 — 


IT 


A W (t+Xi) 1 —A r (n—q)s 


f{s, x 2 ,..., x q ) dx 2 ... dx q ds, 


where 


v= [0, s ] 9_1 cr 1 , 

and 

Q 

f(s, x 2 ,...,x q ) = l[0,oo )(s)\ r e~ Xr3 ]^[ A r e _AT ' a!i l[o, s ] (xi). 

i=2 


Here, l[o,oo)(s) and 1 [o,s](a?i) are indicator functions. 
The integral over xi is: 


f 


e -A w(t+xi) e ~\ r 


1 dxi = e 


1 — e 


~ (Am +\ r )s 


A w + A r 


By independence of all rr^’s, we carry out all the Xi integrals 
and obtain 


h = e~ qXwt \ q r 

: -(A„+A r)s / l-e~^ +A 

\ Atu + 

Making the substitution y = 1 — e - ( A ™ +Ar ) s yields 

h = e~ qX ™ t a q B(q, a(n — q) + 1), 

where a = A A _^ A and B denotes the Beta function. 

Finally, by symmetry, the cases D 2 = M q ,..., and D q = 
M q give the same result, so that 



o -\ r (n-q)s 


ds 




e qX ’ wt a q B(q,a(n 


ff) + l) 


- q \ w t Q q B{q,a(n-q) + l) 
B(q,n-q -b 1) 
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B.2 Generalized timed balls-into-bins model 
for the case of r' / R(w) conditioning on 

r 7 ^ R(w) 

Given r ^ R(w) and r' -< r, some messages from w are 
known to reach the replicas later than the time r' has col¬ 
lected enough acknowledgments and finished. To calculate 
P{r' 7 I R(w) | r 7 ^ f?(w)}, we introduce a slightly general¬ 
ized timed balls-into-bins model. In the generalized model, 
at time t, robot R 2 picks p (0 < p < n) bins uniformly at 
random (without replacement) and sends a ball to each of 
them, instead of sending a ball to each of the n bins as be¬ 
fore. The remaining (n — p) unsent balls are used to model 
the messages that arrive late. 

For the case of {r' 7 ^ R(w) \ r 7 ^ f?(w)}, we consider the 
generalized model in which robots Ri and R 2 represent r' 
and w, respectively. We assume that the random variable 
D r (resp. D w ) for time delay is exponentially distributed 
with rate \ r (resp. A m ). It remains to calculate the ex¬ 
pected time lag between the events that r' and w are issued, 
i.e., E{ui s t — r'st}- This is challenging because there may be 
more than one such r' following the concurrency pattern 
(Definition |4| in a single process. Nevertheless the proba¬ 
bility that there are k ( k > 1 ) such r's in a single process 
decreases exponentially with the ratio 2 (JXx) ’ accor ding to 
Equation (A.l I in Appendix A.l Therefore, we focus on the 


simple case that there is at most one r' in a single process. In 
this situation, the calculation presented shortly yields that 
E{«; s t — r' s t} = ■ Finally, we are interested in the time 

point i' when exactly q = [n/ 2 j + 1 of the n bins have re¬ 
ceived the balls from Ri (i.e., r'), and denote the set of these 
q bins by B. In terms of the generalized timed balls-into- 
bins model, the case of {r' 7 ^ R(w) \ r 7 ^ R(w)} corresponds 
to the event E' that none of the q bins in B receives a ball 
from R 2 (i.e., w) before it receives a ball from Ri (i.e., r'). 

We calculate the expected time lag between the events 
that r' and w are issued (i.e., E{w s t — r( t }) as follows. To 
this end, we first calculate the expected duration of the in¬ 
terval [r'f t ,r s t]. Since r' is required to finish between the 
interval L = [w s t,r s t] whose length follows an exponential 
distribution with rate A, and the inter-arrival time (between 
r' and r here), denoted I, also follows an exponential distri¬ 
bution with rate A, we have 


E{r st -r' ft }=E{I\I<L}=^. 

Thus, the expected time lag between the events that r' 
and w are issued is 

E{w st - r( t } = E{r 3t - r' ft } + E{r' ft - r' sl } - E{r st - ic s t} 

1,1 1 2A-p 

2A u A 2\p ' K ' ’ 

B.3 Calculations of p {r' / R(w) \ r / R(w)} 

Let q = [n/2j + 1. Denote the delay times for each ball 
from robot R 1 (i.e., r') sent to each bin Bi by D\ and the 
delay times for each ball from robot R2 (i.e., w) sent to 
each bin Bi by Di. Let M q = max{D(, D 2 ,..., D' q }. By 
symmetry, 


P(£')= ( Jp(E',B = {1,..., 9 }). 


Given r 7 ^ R(w) and r' -< r, we know that q balls from 
w are bound to reach the replicas later than the time t' of 
interest. The other ( n—q) (corresponding to the parameter p 
in the generalized model) balls are randomly and uniformly 
sent into ( n — q) replicas, one ball per bin. We denote this 
set of (n — q) replicas by B’. 

The case n = 2 is trivial: Since q = n = 2, these two 
balls from w are bound to reach the replicas later than the 
time r' has collected enough acknowledgments and returned. 
Therefore, P{r' 7 ^ R(yS) \ r 7 ^ R(w)} = 1. 

Now we consider n > 2. Assume M q = D[ (without loss 
of generality, the corresponding bin for D[ is denoted by 
61 ; hence &i £ B) and k = \B D B'\ (0 < k < n — q), we 
distinguish the case foi £ B' from b± £ B'. Thus, we shall 
compute 


Ji = J2 (HDi > M q - t', D 2 > D' 2 - t', ..., D k > D' k - t', 

k =0 

D q -\-l > 2 > M q , . . . , D n > Mq} 

+ P{Z ^2 > D 2 — t 5 • • • ? Dk+1 > Dk +1 — t 1 

D' q+1 > M q , D' q+2 > M q , ...,D' n > M q }) , 

where t' = E{ui s t — r' st } = (see Equation (B. 1) in 

Appendix B.21. 

Conditioning on M q = D[, D 2 ,..., and D' q and using the 
independence assumptions, we obtain: 

n ~ q /( 1 \l q ~ 1 M n ~ q 1 

j _ / U/ \k- 1/ \n-q-k) 


-X r (n — q)s 


( 71 ) 

\n — q/ 

. (p (^.(.--.: 1 )‘.:».'«») 

• f(s, X 2 ,...,x'q) dx 2 ... dx'q ds 

(O )( q - k 1 )L-- q l k ) 

L-q) 

1II-L (n( e — 


a -\ r (n-q)s 


■f(s, x 2 ,..., x’q) dx 2 ... dx'q ds 


(B.2) 


where 

W = [ 0 , a ]''” 1 

and 


f{s,x 2 ,.. 

• 5 *Eqr) l[ 0 ,oo) (s)Ar*C 


-\ r x'- - 


to s: 


Notice that 
with respect to s: 


D>t'(A 


denotes a piecewise function 

e A„(t'-s) 

1 if s < t!. 
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and, similarly, 


Carrying out all the x\ integrals, we obtain 




>- x '.)\ 1 x' i >t'( x i'> _ je Xw ^' X ' A if x'i>t'\ 

1 if x'a < t'. 


For convenience, we denote the first multiple integral in Ji 
by Jn and the second J12, and focus on the calculations of 
Jn in the following. First of all, we evaluate the leftmost 
integral of Jn over s by breaking it into two parts: 


/ oo ro 

g(s) ds = X q e Xmt J 


0 ~(^w+^r)s 


^ g —A r £' ^ g — (^zu+^r)*' g — 


+ e 


1 _ p-Vs \ q ~ k 
1 e ' g -A r (n-q)s 


X w + A r 


Ar 


ds. 


(B.5) 


rt r 00 

J 11 = / g{s)ds+ / g(s)ds, (B.3) 

Jo Jt‘ 

where g(s) is the integrand in Jn with respect to variable 
s. 

In the first integral over s £ [0, t '\, we have x[ < s <t' for 
i = 2, 3,..., q. Thus it reduces to 

f g(s)ds=[ [■■■[ e~ Xr(n ~ q)s l[ 0t t']{s)X r e~ Xra 
Jo Jo J Jv 

q 

• ]^[ X r e~ XrXi l[o, s ](x'i ) dx 2 ... dx' q ds 


Substituting Equations (B.41 and (B.51 into Equation (B.31 
yields 


J11 = Xr ^ _ e -A r «) 

(Am +A r )s 


ro o 

+ Ke^'l e 


q-l 


ds 


1 — e~ Xrt . A t / e -(A„+V)t _ e -(A„+x r )i 


Xr 


+ e 


X w -(- X r 


1 _ p-Ar-s \ 

1 e ] g —At- (n — q)s 

Xr 


ds. 


(B.6) 


= A? /” g-Ar(«-«+l)s 

Jo 


■ ■ ■ J Y\. e XrXi d X 2 ■ ■ ■ ds 


rt' 


, 9- 1 


= X r e-M»-?+Ds ^ _ g-A r »y- ds . 


(B.4) 


The second integral over s £ [T, 00 ) reduces to 

L Md —L / ■ 7/ a '"'"’ 

n 

vi=2 v 

llt'.coWAre-^nAre-^lp,,,^) 


] e _ Ar .( n _q) s 


• dx 2 ■ ■ ■ dan ds. 


Each of the k — 1 integrals over a^ (j = 2, 3,..., k) is 

f X w (t r — x'j) — X r x'- 1 / 

/ e v l 'e 1 ax* 

Jo 


f e~ XrX,i dx'i+ [ 
Jo Jf 


^(i'-.Og-A rx' %d ' 


1 - e~ Xrt ' x t' e _(A ” +A^)^, - e -( A ™+ A ’-) s 
■ + e ” •- 


A r 


Alt; “f - Ar 


while each of the remaining (q — k) integrals over x[ (i = 
k + 1, k + 2,... ,q) is 


/ 


-Arm'. , / 1 - e 

e * dx; = -— 


9-1 


ds 


In the same way, we have 
Jia = A r J* e - XAn ~ q+1)s (l - e" Vs ) 

/ OO 

e~ XrS 

1 - e _Ar ' ^, a t' e~ {x ™ +Xr)t ' - e ~ < - Xw+Xr)s 
-t -he” •- 


X w + Ar 


1 — e 


\ r s \ q-i-k 


X r 


-Xr(n-q) s 


ds. 


(B.7) 


Substituting Equations (B .61 and (B.71 into Equation (B.21 
yields 


n-q /(q- 1 }/ n ~q \ n ~ q 1 

T \ ^ / vfc —1/ \n — q — k) T . \ k J \n — q — k) T 

Jl = 1^\ -H-pp- J i2 

k =0 \ Vn —q/ Vn — q) 

rt 


7 

n — q 

+ E 


- A r (n —q+l)s 


= A r I e 
Jo 

n-q /g-l\ / n-g 

\k — 1 / \n — q — kJ A q \ in t 

-X q r e ” j e 


k =0 \n — qJ 


(l-e- Vs ) 5 

! 7; 


ds 

— (Ait; +A r )s 


g — . / g—(^-w-\-X r )t' _ g—(Ato+Ar)^ 


+ e 


1 _ p-A r s\ 

- 1 - c g — A r (n —q)s 


An; + A r 


A r 




+ E 


f 1 ? -1 ') f 1 

Vfc/ \n — q — k) 


k =0 \n—qJ 


/; 


A? / e 

t' 


1 - X r t 

1 — e x w t e 

- 7 -he” 


1 —\ r s \ q 1 k 

1 e \ e _A r(ri-q) 


, g — (Arj+Arjd _ g-(A m +A r )j 


Att “h Ar 
ds. 


(B.8) 
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Finally, by symmetry, the cases D' 2 = M q ,..., and D' q = M q 
give the same result, so that 

F(E') = 


<) J 1 = 

1 


_ J\ _ 

B(q,n — q+ 1) 


if n > 2, 
if n = 2. 
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